Introduction
This is my report for the final project of the Coding 2: Web scraping course at the Central European University.
In this project I had to do two things:
- Download the top 100 tennis player historically
- Select the year 2010 and download all the player statistics
I was also advised to use the techniques that I learned during the course: write functions, use lapply, and rbindlist to create dataframes and create a report on the findings.
Preparatory steps
First, I clean the environment and install the required packages.
#clean environment
rm(list =ls())
#loading packages with pacman
if (!require("pacman")) {
install.packages("pacman")
}
pacman::p_load(rvest, data.table, xml2,kableExtra,ddplot,lubridate, ggrepel, gridExtra, readr, dplyr)
#load ddplot package for racing bar chart
remotes::install_github("feddelegrand7/ddplot", build_vignettes = TRUE)Ranking table historically
Creating the database
Next, I read in the base html and create a list from the drilldown of dates on the site, which I will use to create all the links to download the historical data.
#read in base website
t <- read_html("https://www.atptour.com/en/rankings/singles")
#creating function to remove unnecessary text
clean_up <- function(x) {
stringr::str_replace_all(x, "[\r\t\n]", "")
}
#get all the dates for which there is a list of top 100
dates <- t %>% html_nodes(".dropdown-holder-wrapper:nth-child(1) li") %>% html_text() %>% lapply(clean_up)
#first date is twice, and there are two dates for which there is absolutely no data on the site (they are bugs), so we remove these from the dates
dates <- dates[-1]
dates <- dates[dates != "1985.03.03"]
dates <- dates[dates != "1976.03.01"]
#creating the links
links <- paste0('https://www.atptour.com/en/rankings/singles?rankDate=', dates)Next, I create a function that gets the ranking for one date.
#write function that gets the top 100 tennis players for one date from atptour.com
#and returns with a dataframe of the ranking, move, country, player, age, tournaments played, points dropping and next best
get_top100 <- function(t_url) {
t <- read_html(t_url)
df <- t %>% html_nodes(xpath ='//*[@id="rankingDetailAjaxContainer"]/table') %>%
html_table()
df <- df[[1]]
df <- df %>% mutate("Country" = substr(t %>% html_nodes(".country-item") %>% html_nodes("img") %>%
html_attr("src"), 26, 28) %>% toupper())
df[["Ranking"]] <- as.numeric(df[["Ranking"]])
df[["Points Dropping"]] <- as.numeric(df[["Points Dropping"]])
df[["Move"]] <- as.numeric(df[["Move"]]) # change if NA to 0!!!
temp <- t %>% html_nodes(".move-cell") %>% html_children() %>% html_attrs()
df <- df %>% mutate(Move = case_when(temp[seq(1, length(temp), by = 2)] == "move-down" ~ Move * (-1), TRUE ~ Move))
df [["Date"]] <- t %>% html_node(".dropdown-label") %>% html_text() %>% lapply(clean_up)
df [["Date"]] <- as.Date(gsub(".", "-", df$Date,fixed = TRUE))
df [["Points"]] <- as.numeric(gsub(",", "", df$Points,fixed = TRUE))
names(df) <- c("ranking", "move","country", "player","age","points","tourn_played","points_dropping","next_best","date")
return(df)
}After we have the function, I use lapply and rbindlist to create the final dataframe.
#apply the function to all dates
list_of_dfs <- lapply(links, get_top100)
#bind the lists together to create the final dataframe
final_df <- rbindlist(list_of_dfs)
#Save the dataframe to a csv
write.csv(final_df,"data/final_df.csv", row.names = FALSE)The dataframe has 217669 record and its head looks like this:
#read in the dataframe
final_df <- read_csv("data/final_df.csv")
#show its top 5 rows
head(final_df,5) %>%
kbl() %>%
kable_styling("basic") %>%
kable_paper("hover", full_width = T)| ranking | move | country | player | age | points | tourn_played | points_dropping | next_best | date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | NA | SRB | Novak Djokovic | 34 | 11540 | 14 | 0 | 0 | 2021-12-13 |
| 2 | NA | RUS | Daniil Medvedev | 25 | 8640 | 23 | 0 | 0 | 2021-12-13 |
| 3 | NA | GER | Alexander Zverev | 24 | 7840 | 23 | 0 | 0 | 2021-12-13 |
| 4 | NA | GRE | Stefanos Tsitsipas | 23 | 6540 | 26 | 0 | 0 | 2021-12-13 |
| 5 | NA | RUS | Andrey Rublev | 24 | 5150 | 28 | 0 | 0 | 2021-12-13 |
Visualizations
To get some insights about this vast dataset that I scraped, I create some visuals. For starters, I decided to create a racing bar chart showing the evolution of the number of points by the top 50 tennis players over time (I decided to restrict to top 50 because to see the full 100 requires a large screen). This interactive visual gives the viewer an idea of the evolution of the best tennis players over time. Since the dataframe only contains points from 1996-08-12 onwards, this is the starting date for the racing bar chart.
Having seen this evolution of players over time on the racing bar chart, I got curious about who were the players that were able to climb up the most on this list and who were the ones that somehow achieved a great position on the ranking table but then fell down spectacularly. On this next bar chart, I visualize the top 10 players that moved up the table and the top 10 players that moved down the table. Pat Cash and Vitas Gerulaitis turned out to be the players that improved the most, climbing a total of more than 300 places up over their careers. At the other end of the distribution, Tom Okker, Paolo Bertolucci, Jaime Fillol Sr. and Roscoe Tanner all went in the opposite direction more than 100 places cumulatively.
Now that we have seen the evolution of tennis ranking players and the players that moved the most on the table, I decided that it was time to also see the main point of such tables: I visualized the best players of all time. For this, I calculated the average ranking by player over their careers. As can be seen on the chart the all time great (from 1973 to now) is Bjorn Borg, Roger Federer comes second, while Rafael Nadal managed to seal the third place. Novak Djokovic and Ivan Lendl come in the fourth and fifth places.
Finally, I decided to utilize another aspect of the data, namely that for each player from the flags I was able to get the nationalities of the players. I group the entire dataset by the nationalities and calculate the average ranking by their players. Having created this view of the data, I visualize the top and bottom 5 countries that had at least one player who made it to the top 100 from 1973 to now. What we can see on the chart is that interestingly Greece, Bulgaria and less surprisingly Switzerland have the lowest average scores, while Kenya, Bahrein and Montenegro still have a lot of work to do to become dominant in tennis.
Player statistics for top 100 in 2010
After downloading the entire database historically, I was required to choose one year and download all the player statistics. I chose the ranking table at the end of 2010, as my favorite player Rafael Nadal led the ranking table at the time.
Creating the database
First, I read in the base html with the given date and get all the relative links pointing to the top 100 players in the ranking table. Pasting the relative links with the base, I get the full links to all players of interest.
#read in base website
t2 <- read_html("https://www.atptour.com/en/rankings/singles?rankDate=2010-12-27")
#get the relative links to the players
rel_link <- t2 %>% html_nodes(".player-cell") %>% html_node("a") %>% html_attr("href")
#create the full links
links2 <- paste0("https://www.atptour.com", rel_link)Now that I have the links to the players, the exercise was to get all the information I can find for the players from these sites. To this end, I created a gigantic function that does just that, from online social media, ranking, nationality and birthday to height and weight, I also scraped the statistics for the careers of the players and for the still active ones their statistics for 2021. I do not show this function in this report (it is simply too long).
With the help of this super function, it takes now only two lines of code to create the second final dataframe contatining detailed information for all tennis players that made into the top 100 ranked at the end of 2010.
#apply the function on the links
list_of_dfs2 <- lapply(links2, get_player)
#bind the lists for all dates in a dataframe
final_df2 <- rbindlist(list_of_dfs2)
#write to csv
write.csv(final_df2,"data/final_df2.csv", row.names = FALSE)The dataframe can then be read and it looks like this (scroll to see all columns):
| name | current_rank | nationality | youtube | website | birthday | age | turned_pro | weight_kg | weight_lbs | height | height_cm | birth_city | birth_country | handed | style | coach(es) | rank2021 | win2021 | lose2021 | titles2021 | prizemoney2021 | bestrank | bestranktime | win | lose | titles | prizemoney | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rafael Nadal | 6 | ESP | https://www.facebook.com/Nadal | https://www.instagram.com/rafaelnadal/ | https://twitter.com/RafaelNadal | NA | http://www.rafaelnadal.com | 1986-06-03 | 35 | 2001 | 85 | 187 | 6’1 | 185 | Manacor | Mallorca | Left-Handed | Two-Handed Backhand | Carlos Moya, Francisco Roig | 6 | 24 | 5 | 2 | 1478830 | 1 | 2008-08-18 | 1028 | 209 | 88 | 124961595 |
| Roger Federer | 16 | SUI | https://www.facebook.com/Federer | https://www.instagram.com/rogerfederer/ | https://twitter.com/rogerfederer | NA | http://www.rogerfederer.com | 1981-08-08 | 40 | 1998 | 85 | 187 | 6’1 | 185 | Basel | Switzerland | Right-Handed | One-Handed Backhand | Ivan Ljubicic, Severin Luthi | 16 | 9 | 4 | 0 | 647655 | 1 | 2004-02-02 | 1251 | 275 | 103 | 130594339 |
| Novak Djokovic | 1 | SRB | https://www.facebook.com/djokovicofficial/ | https://www.instagram.com/djokernole/ | https://twitter.com/DjokerNole | https://www.youtube.com/djokovicofficial | http://novakdjokovic.com | 1987-05-22 | 34 | 2003 | 77 | 170 | 6’2 | 188 | Belgrade | Serbia | Right-Handed | Two-Handed Backhand | Marian Vajda, Goran Ivanisevic | 1 | 55 | 7 | 5 | 9069225 | 1 | 2011-07-04 | 989 | 199 | 86 | 154756726 |
| Andy Murray | 134 | GBR | https://www.facebook.com/andymurrayofficial/ | https://www.instagram.com/andymurray/ | https://twitter.com/andy_murray | NA | http://www.andymurray.com | 1987-05-15 | 34 | 2005 | 82 | 181 | 6’3 | 191 | Glasgow | Scotland | Right-Handed | Two-Handed Backhand | Jamie Delgado | 134 | 15 | 14 | 0 | 514248 | 1 | 2016-11-07 | 691 | 214 | 46 | 62314306 |
| Robin Soderling | NA | SWE | NA | NA | https://twitter.com/RSoderling | NA | http://robinsoderling.se | 1984-08-14 | NA | 2001 | 87 | 192 | 6’4 | 193 | Tibro | Sweden | Right-Handed | Two-Handed Backhand | NA | NA | NA | NA | NA | NA | 4 | 2010-11-15 | 310 | 170 | 10 | 10423124 |
Visualizations
This dataset is a much more detailed one, with plenty of interesting insights to potentially show. I decided to limit these to three especially interesting ones.
For the first, I decided to create a new variable that counts the number of social media platforms that a user has. The role social media platform plays in the value and success of sport players became an interesting topic in sport sciences over the past couple of years and so I was curious of the distribution. Of course the number of followers would probably tell even more, but I still found it interesting to look at how common it is among tennis players to build their brands in various different platforms. As can be seen on the histogram, on average they use 2.31 sites, but some use 5 different ones and others do not have any.
For the second plot, I again was curious about a distribution, but now I wanted to see at what age players reach their top position on the ranking table. For this, I again needed to create a new variable, in which I subtracted the birthdate of the players from the time when they were at their peak in terms of place in the ranking table.
As can be seen on the figure, tennis players generally tend to reach their peak at around 27 years of age, but some interesting extreme values are also worthwhile to note: There are some players who reached their peak at 20, while for some others it took much more time, way into their 30s up to 36 years of age.
Finally, I also wanted to create a plot that shows the relationship between variables in the dataframe. To this end, I visualize a scatter plot with the number of wins the players have had over their career on the x axis and the prizemoney they collected. The points are then colored based on the style they play. What we can see on this chart is not atypical of single player sports. There are some players who are way above the others, both in number of wins and money collected. The notable ones are Roger Federer, who has the most wins and Novak Djokovic, who collected the most money, but also Rafael Nadal and Andy Murray are notable names. Among the rest, there is also a clear positive relationship between the number of wins and prize money collected. Regarding the style, most players play with two0handed backhand, but there are some who lpay one-handed backhand (the most notale being Federer), and for some it is unknown.
Conclusions
To conclude, in this project, I downloaded the top 100 tennis player historically and selected the year 2010 and downloaded also all the player statistics for that year. Using some interesting visuals, I was also able to learn about the data some interesting insights:
I looked at the evolution of the ranking table over time with the help of a racing bar chart. Then, I learned about who were the players that moved the most upwards and downwards on the ranking table throughout their careers (Pat Cash moved most up, while .. moved the most down). Next, I also looked at who are the best players of all time based on average ranking over their careers (Bjorn Borg tops the list). After this, I looked at the best and worst performing countries in tennis (interestingly Greece came on top, while Kenya is at the bottom).
For the detailed player statistics of the 2010 ranking table, I also got some insights through visuals. I learned that players on average have around 2-3 social media sites, and reach their peak performance at around the age of 27. Finally, I also looked at the relationship between the number of wins by players and the prize money they collected. There is a clear positive relationship with some exceptional players having much more wins and also much more prize money.